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Abstract 

We consider a general lattice model of a finite protein in its environment and calculate its Boltzmann entropy 5'(-E') as a function of its 
energy i5 in a microcanonical ensemble, and Gibbs entropy S{E) as a function of its average energy i5 in a canonical ensemble by exact 
enumeration on a square lattice. We find that because of the finite size of the protein, (i) the two are very different and S{E) > S{E), 
(ii) S{E) need not be concave while S{E) is, and (iii) S{E) is relevant for experiments but not S{E), even though S{E) is conceptually 
more useful. We discuss the consequences of these differences. The results are general and applicable to all finite systems. 



Self-assembling small proteins are a prime example of 
small systems, and can fold into their native states (of min- 
imum free energy) without any chaperones. They have 
been extensively investigated recently using lattice mod- 
els by thermodynamic principles fT]. The smallest known 
natural protein is Trp-Cage derived from the saliva of Gila 
monsters and has only 20 residues. Their first-principle 
study requires short ranged model energetics that, while 
remaining independent of the thermodynamic state of the 
protein such as its conformation, temperature T, pressure 
P, etc., determine the native state(s), and has to be judi- 
ciously chosen to give a unique and right native state. It 
should be stressed that proteins in Nature are never iso- 
lated but always occur in an environment such as a cell 
controlled by the temperature T. Thus, the proper way to 
study proteins is to consider the canonical ensemble (CE), 
and not the microcanonical ensemble (ME). Moreover, the 
two ensembles are most probably not equivalent for a fi- 
nite system. Despite this, investigations using ME are very 
common for proteins. Therefore, their predictions must 
be carefully examined and compared with those from CE, 
keeping in mind their possible non-equivalence. Unfortu- 
nately, this does not seem to be practiced in the field, which 
as we will establish here may be quite dangerous for finite 
systems such as proteins. 

The ME entropy is given by the Boltzmann relation 
S{E) = lnW{E), where W{E) is the number of pro- 
tein conformations of energy E. Since folding is a confor- 
mational change into the native state, the conformational 
entropy S{E) is believed to play a central role in deter- 
mining the way folding occurs into compact native states 
along a very large number of microscopic pathways that 
connect them to myriad unfolded conformations. It also 
characterizes the potential energy landscape HSi]. It is 
a well-established tenant of macroscopic thermodynamics 
that W{E) decreases with falling energy E as folding pro- 
ceeds (dS/dE > 0); consequently, the energy landscape 
is expected to possess a structure that narrows down with 
falling energy, such as a funnel [5]. It is known that the 
entire thermodynamics is contained in S{E), which must 
be concave [6] for a macroscopic system. This concav- 
ity is built-in in the random energy model fj^, which has 



been extensively employed for proteins; see for exam- 
ple, which also shows that the energy gap above the ground 
state (lowest energy state) is crucial for foldability. The 
resulting lack of concavity has been used as a sign of a 
first-order folding transition for small proteins by several 
workers. It should be noted that there are other idealized 
physical models such as the KDP model that freeze into 
the ground state at finite T due to a similar gridlock liSiQ]. 

A prop er model of a protein should satisfy certain princi- 
ples lllOll . one of which is the requirement of cooperativity. 
The sequence of residues also plays an important role in de- 
termining the native state |11]. However, there is no con- 
sensus for general energetics to describe all proteins, and 
there remains a certain amount of freedom in the choice, at 
least in modeling. It is widely recognized that secondary 
structures are also important in the folding process [12]. 
The simplest model is the standard model, which classifies 
the 20 different residues into two, H (hydrophobic) and P 
(hydrophilic), and allows only nearest-neighbor attractive 
HH interaction chh (set = — 1 in some predetermined unit) 
to provide good hydrophobic cores [12]; however, consid- 
eration of local energetics of the 20 kinds of residues lIlBIl 
is also common. It is found that the introduction of multi- 
body interactions enhances cooperativity fl^, and should 
not be neglected. It is important, therefore, to investigate 
the energetics effects on the form of S{E), which to the 
best of our knowledge has not been studied carefully. 

The direct experimental approaches (primarily. X-ray 
crystallography or NMR spectroscopy) to determine en- 
ergetics requires information about the typical conforma- 
tion associated with the average energy E{T). Thus, CE 
must be used to determine the dependence of the canon- 
ical entropy S{T) on E, given by the Gibbsian relation 
S{T) = -^p(r,r)lnp(r,r), where p(r,r) is the 
probability to be in the conformation T at T. For a macro- 
scopic system, S{E) and S{T) are the same so that S{T) 
allows us to identify conformations of average energy E. 
Their equality is crucial for the direct experimental ap- 
proaches in which conformations associated with E need 
to be identified as typical. Thus, it is also important to ver- 
ify if the two entropies are the same for finite proteins that 
are of interest here. If not true, the interpretation of experi- 
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FIG. 1: A 2-d model of a finite protein on a square lattice. The 
red spheres represent hydrophobic sites and the blue spheres rep- 
resent hydrophilic sites. 



mental data for the energetics could be incorrect. This will 
become a limitation of the direct experimental techniques. 

Model The interplay of intra-protein molecular interac- 
tions, the interaction with the surrounding, and the residue 
sequence to give rise to the folded native state is quite in- 
tricate and far from a basic understanding at present; much 
remains to be understood. It has been argued that conflicts 
among interactions also play a significant role in folding 
1 15n . In general, the model should contain various inter- 
actions relevant not only for protein folding and various 
secondary substructures like helix formation in the native 
state, but also for proteins considered as semi-flexible het- 
eropolymers lfl6ll with certain specific sequences ifTvll . The 
model should also contain solvation effect, as all protein 
activity occurs in the presence of water or solvent. In this 
work, we use such a model, which has been investigated by 
us recently in different limits one of which is the stan- 
dard model described above. Here, we only report some 
unexpected results for finite proteins, which have not been 
noted earlier to the best of our knowledge. It should, how- 
ever, be recognized that finite proteins cannot undergo a 
sharp folding transition. This issue is not relevant as we 
are only interested in comparing S{E) and S{E). 

We consider a protein with M residues in a given se- 
quence on a square lattice, with one of its ends fixed at 
the origin so that the total number of conformations W for 
a finite protein remains finite even on an infinite lattice. 
We generalize a recent model ifl^ . in which the number 
of bends A'b, pairs of parallel bonds A'p, and hairpin turns 
A'hp characterize the semiflexibility; see Fig. [H where we 
show a protein in its compact form so that all the solvent 
molecules (W) such as water are expelled from the inside 
and surround the protein. We do not allow any free vol- 



ume. The red spheres denote hydrophobic residues (H) and 
blue spheres denote hydrophilic (i.e., polar) residues (P). 
The nearest-neighbor distinct pairs PP, HH, HP, PW and 
HW between the residues and the water are also shown, 
but not the contact WW. Only three out of these six con- 
tacts are independent on the lattice lfl9ll . which we take 
to be HH, HW, and HP pairs. A bend is where the pro- 
tein deviates from its collinear path. Each hairpin turn 
requires two consecutive bends in the same (clockwise or 
counterclockwise) direction; see FigH] Two parallel bonds 
one lattice spacing apart form a pair (p). We also con- 
sider the number of helical turns iVhi. On a square lattice, 
a "helical turn" is interpreted as two consecutive hairpin 
turns in opposite directions; see FigUl The correspond- 
ing energies are e^, ep, Chp, and e^i, respectively. The pair 
interaction energies are enn = —1) Chwi and eep, and 
the pair numbers are A'hhi -^hwi and A'hp, correspond- 
ing to the HH, HW, and HP, respectively, respectively. We 
let e denote the entire ordered set {ei},with i ordered as 
b,p,hp,hl,HH,HW, and HP, and e' the ordered set {cj} ex- 
cluding eHH(= -1). Similarly, N = N(r) = {A^^(r)}, 
and N' denotes all {Ni} but Nnn. The three most of- 
ten energy choices we have made are: (A) e' = 0, (B) 
e' = (a, —a, —2a, —a, 25a, 5a), a = l/50(<< 1), 
(C) e' = ( 6, -b, -b, -6, 26, b),b = l/3(~ 1). The stan- 
dard model is (A). In the model (B), we have most other 
interactions much weaker than IchhI^ while they are com- 
parable to IchhI in the model (C). Thus, (B) is closer to 
(A) flian (C) is. Despite fliis, we wifl see that (B) and (C) 
behave very different from (A). It should be noted that W 
does not depend on the model; it is its partition into W{E) 
that depends on the model. Thus, the shape of the energy 
landscape changes from model to model, but not its total 
"area" which is given by 14^ [51]. We have also considered 
random, ordered and fixed sequences. We consider com- 
pact and unconstrained protein conformations jlSh sepa- 
rately. We have found that in the majority of cases that we 
have investigated, the sequence containing a repetition of 
PPHH gives rise to the lowest energy or very close to it. 
The energy of a given conformation T is 



s(r)=e-N(r) = Ee.iv.(r). 



(1) 



We partition W according to N or E, so that W = 
J2n W{N) = W{E), where W(N) [or W{E)] is 
the number of conformations for a given set N [or E]. On 
a lattice, E remains a discrete variable, but this fact is not 
important for our final conclusions as we will discuss be- 
low. In the standard model, E = — A'hh- It is clear from 



(2) 



that the entropy S^Nnn) = In VF(A'hh) for a given N^n, 
regardless of N', is maximum in the standard model ll2()ll 
and provides a possible justification of the observation 
made in lfl4h . A protein with a given A^hh will probe many 
more states in the standard model, where there is no ener- 
getic penalty to explore all possible N', than in any other 
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model with energetic penalty, which then slows down it 
approach to the native state. Thus, it is important to havi 
non-zero e' to step up the approach to the native state. (I 
is highly likely that the native states in different models an 
different, but this does not affect the above conclusion. 
There is another important consequence of e' = 0. Thi 
fluctuations in the corresponding Ni are maximum as then 
is no penalty no matter what N' is. The protein will spen( 
a lot of time probing a large number of conformations cor 
responding to the maximum fluctuations in N'. This als( 
suggests that we need to go beyond the standard model t( 
describe proteins that fold fast. 

The canonical probability distribution for T i 
p(r,r) = e-''^'^^yZ{T), where 

Z{T) = J2e->'^'^^^ = J2W{E)e-^^, (3 
r E 

the partition function, describes the finite protein thermo 
dynamics; here, (3 = 1/T (we set the Boltzmann constan 
/cb = !)• The distribution p{T) can be used to define th 
average <> of any thermodynamic quantity (also denote( 
by an overbar in the following) such as N(T) =< N > 
and E{T) =< E >= e ■ N{T) [e = E/M]. The fre( 
energy F(T) = —Tin Z{T) gives the canonical entrop; 
S{T) = —dF(T)/dT , which satisfies the conventional 
thermodynamic relation F{T) = E[T) — TS{T), and the 
Gibbsian relation quoted above, as can be easily checked. 
Both S{T) and E{T) are continuous function (except pos- 
sibly at a phase transition, which is not relevant here as we 
are dealing with a finite protein) of the continuous variable 
T. Moreover, F(T) is monotonically decreasing with T as 
expected. 

Since the derivative dE/dT is non-negative as can be 
easily checked, E can be inverted to express T as a func- 
tion T(e), which then allows us to express S{T) as an ex- 
plicit function 5(E) = S'[r(e)] ofE. The entropy 'S(E) 
can be thought of as the canonical equivalence of the mi- 
crocanonical entropy S{E). However, they are two differ- 
ent quantities for finite proteins. In the first place, S{E) 
is a discrete function since E is discrete, while S{E) is a 
continuous_function_since E is continuous. In the second 
place, S{E) > S{E), the equality holding as M oo 
Isj]. To demonstrate this, let us assume that E = E is 
one of the energies in the sum in (O. We then rewrite 
S(E) = In Z + E/T, and evaluate TF(E) = e^p[S(E)]: 

W(E) = W(E) + Yl WiE)e-^^''-^'>; (4) 

hence, exp[5(-E')] > exp[VF(ii^)] as asserted above. The 
difference between them is due to the non-negative last 
term in which vanishes as — > oo. In case, E is 
not one of the energies in the sum, we can use a suitable in- 
terpolation to define W{E), without affecting the conclu- 
sion ifisll . The above proof does not depend on the discrete 
nature of the energies in ME; thus, it is also valid for con- 
tinuum models. We show in Fig|2]the exactly enumerated 
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FIG. 2: Continuous s(e) (blue and red curves), and discrete 
s(e) (blue and red points) for a given sequence (M = 24, 
unrestricted);. The bands in s become more pronounced and their 
separations decrease as M increases. Note a clear band in s at low 
energies and the native state, disjoint from the rest of the bands. 



entropies per residue s(e) = {1/M)S{E) (red curve as a 
guide through discrete points) and s(e) = {1/M)S{E) 
(blue curve) for the model (B) (M = 24; unrestricted 
conformations) as a function of the discrete variable e = 
E /M or e. In addition, we also see a distinct band struc- 
ture in s{e) that gives rise to regions of non-concavity f^, 
which is related to the nature of the interactions and has 
no implication for any phase transition as we will discuss 
below. 

It is easily seen that the canonical entropy function sat- 
isfies the conventional thermodynamic relation Jsj] 

dS(E)/dE = l/T, (5) 

and is, therefore, concave (d^S(E) /dE^ < 0) |@]. On the 
other hand, the microcanonical entropy need not be con- 
cave; see Fig|2] where the bands seen in s{e) have both 
positive and negative slopes, which is in contradiction with 
dS]) valid for s(e). The non-concave S{E) does not vio- 
late finite system thermodynamics. The canonical entropy 
is the physical entropy for proteins in its environment and 
remains concave in Fig. |2] as required by thermodynamic 
stability. 

To understand the absence of concavity, we first con- 
sider the model (A). In all cases that we have studied f\^, 
S{E) = S{N}iii) is found to be concave. The number 
of states W{Niiu) can be partitioned into VF(A'hh,N'); 
see dll). In the model (B), e' ~ 0; therefore, most of the 
conformations in W^Nnn) have energies that are close to 
— A'hh; some of them will have energies that are outside 
the range (— A'hh — !> — -^hh + !)• The resulting S{E) as- 
sociated with this A'hh is almost concave, as seen in each 
band in Fig|2l This then gives rise to the lack of concavity 
in the region where two nearby bands overlap. The num- 
ber of bands equals the number of possible values of Nnn 
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in the model (A). These convex portions of s{e) disappear 
and s{e) approach s{e) from below as M — >^ oo js!]. But 
for finite systems, the convex regions persists. The band 
structure persists for all sequences that we have checked. 
The additional energies in the model (C) provide enough 
spread for bands to overlap; this reduces the size of con- 
vex regions. Even here, we have found that the band na- 
ture survives at the upper and the lower ends of the energy. 
Thus, we are confident that convex regions in S{E) will 
exist in any realistic model of a protein. Their presence, 
however, does not imply any phase transition, as S{E) is 
always concave. This is true even though we note from 
Fig|2l that there is a clear gap at the lowest energy. The 
energy gap causes convexity in S{E), but not in S{E). 

Because of conformational changes during folding, the 
folding is believed to be governed by the multiplicity 
W{E), which in turn governs the energy landscape for 
which W{E) represents the "surface area" of the hyper- 
surface of the landscape at energy E ^ : each point on the 
hypersurface represents a conformation. The lack of con- 
cavity discovered here has a profound effect on the shape 
of the landscape. It no longer narrows down as E de- 
creases. It will be interesting to pursue the consequences 
of this shape modification. This is beyond the scope of 
the present work, but we hope to consider it elsewhere. It 
is evident, and as discussed above, several different N will 
usually mix together for a given E, except in the model (A) 
in which E = -Nun so that W{E) = W{Nnu). There 
will be a certain landscape topology for the standard model, 
which will change with e'. From Q, it is evident that the 
landscape will become drastically narrower for e' 7^ 0. The 
total "surface area" W of the landscape does not change 
with e', even though the allowed energies change and they 
become closer. The landscape narrowing and closeness of 
energies at constant W make the approach to native state 
presumably i^more directional and fast. 

Since it is CE that is relevant for a real protein in its 
environment, it is the canonical multiplicity W{E) that 
is relevant for folding. As shown above, it continuously 
increases with E, until we reach at infinite temperatures. 
Thus, the narrowing of the landscape with non-zero e' may 
not be as relevant for protein folding as the observation that 
W(E) > W(E). From 0, we observe that W(E) gets 
contribution from all conformations, not just the conforma- 
tions in W (E). Thus, it may be misleading to think that 
a finite protein at a given T only probes some typical con- 
formations of average energy E. (It is possible that E may 
not even be an allowed energy E.) It also probes native 
state, though its probability is going to be small. As T is 
reduced, this probability increases. In addition, the protein 
restricts its search to effectively a smaller set of conforma- 
tions, closer in energy. It would be interesting to follow the 
consequence(s) of this observation. 

In conclusion, we observe that the microcanonical en- 
tropy, which dictates the form of energy landscape, does 
not satisfy concavity; however, this violation does not im- 



ply any impending phase transition; the latter requires in- 
vestigating the behavior of the canonical entropy, which 
always satisfies concavity. However, nearly all works on 
protein thermodynamics have not paid any attention to this 
issue. This may be dangerous. The most surprising result is 
the tremendous difference between the two entropies: the 
canonical entropy is almost twice as big as the microcanon- 
ical entropy at intermediate energies, but much larger at 
low energies. Its implication for experimental data inter- 
pretation, as noted above, needs to be further pursued. 
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